import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')Project Code
GGAP
0.1 Data quality check / cleaning / preparation
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks. An example is given below.
raw_crime_data = pd.read_csv("crimes_-_2022.csv")
raw_crime_data.shape(239043, 22)
raw_crime_data.isnull().sum()ID 0
Case Number 0
Date 0
Block 0
IUCR 0
Primary Type 0
Description 0
Location Description 881
Arrest 0
Domestic 0
Beat 0
District 0
Ward 10
Community Area 0
FBI Code 0
X Coordinate 4107
Y Coordinate 4107
Year 0
Updated On 0
Latitude 4107
Longitude 4107
Location 4107
dtype: int64
0.1.1 Full data preparation
By Alyssa Shou
# convert the Date column to datetime object
raw_crime_data.Date.dtypesdtype('O')
# converting the date column to datetime objects #
raw_crime_data['Date'] = raw_crime_data['Date'].apply(pd.to_datetime)
type(raw_crime_data.Date[0])pandas._libs.tslibs.timestamps.Timestamp
0.1.2 Crime type preparation
By Alyssa Shou
Since there are 881 NA values in Location Description, I had to drop them because the location of the crimes is necessary for my analysis. No other changes were made to other columns before I began slicing the data I specifically needed for EDA.
# Dropping a very small percentage of the data so general trends should not change
881/2390430.0036855293817430338
time_data = raw_crime_data.dropna(subset = ['Location Description'])
time_data.shape(238162, 22)
0.1.3 Theft data preparation
By Grace Chang
The following data preparation steps served to subset theft data and subsequently merge it with community area data in order to run further analyses on theft occurrence across community areas.
Since the community areas in the original dataset are labeled by their numeric code, I merged it with another dataset to pair these codes with their corresponding community names.
In the original dataset, numeric codes needed to be converted to double digits so that the merge could take place. All observations of “The Loop” were also cleaned so that their format was consistent with the rest of the community area names.
theft_data = raw_crime_data.loc[(raw_crime_data['Primary Type'] == 'THEFT'), :]community_areas = pd.read_html('https://en.wikipedia.org/wiki/Community_areas_in_Chicago', match = 'Chicago community areas by number, population, and area')
community_areas = community_areas[0]
community_areas.columns = [col[0] for col in community_areas.columns]
community_areas.columns = ["No.", "Community Name", "Population", "Area (sqmi)", "Area (sqkm)", "Density (sqmi)", "Density (sqkm)"]
community_areas.head() # Formatting the Community Areas dataset to prepare it for merging| No. | Community Name | Population | Area (sqmi) | Area (sqkm) | Density (sqmi) | Density (sqkm) | |
|---|---|---|---|---|---|---|---|
| 0 | 01 | Rogers Park | 55628 | 1.84 | 4.77 | 30232.61 | 11672.81 |
| 1 | 02 | West Ridge | 77122 | 3.53 | 9.14 | 21847.59 | 8435.36 |
| 2 | 03 | Uptown | 57182 | 2.32 | 6.01 | 24647.41 | 9516.37 |
| 3 | 04 | Lincoln Square | 40494 | 2.56 | 6.63 | 15817.97 | 6107.32 |
| 4 | 05 | North Center | 35114 | 2.05 | 5.31 | 17128.78 | 6613.42 |
theft_data.loc[:,'Community Area'] = theft_data['Community Area'].apply(lambda x: f'{int(x):02d}' if isinstance(x, (int, str)) and str(x).isdigit() else x)
theft_data['Community Area'].value_counts().head(6) # Checking that the single digit codes were indeed converted to double digit format08 4251
32 3464
28 3373
24 2681
06 2268
25 1823
Name: Community Area, dtype: int64
theft_data.loc[:,'Community Area'] = theft_data['Community Area'].astype(str)
theft_community_data = theft_data.merge(community_areas, left_on = 'Community Area', right_on = 'No.', how = 'left')
theft_community_data['Community Name'] = theft_community_data['Community Name'].apply(lambda x: x.replace('(The) Loop[11]', 'The Loop'))
theft_community_data.head(3) # The newly merged dataset with cleaned community names| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | Latitude | Longitude | Location | No. | Community Name | Population | Area (sqmi) | Area (sqkm) | Density (sqmi) | Density (sqkm) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13209277 | JG422539 | 08/15/2022 09:00:00 AM | 0000X W JACKSON BLVD | 0810 | THEFT | OVER $500 | COMMERCIAL / BUSINESS OFFICE | False | False | ... | NaN | NaN | NaN | 32 | The Loop | 42298 | 1.65 | 4.27 | 25635.15 | 9897.73 |
| 1 | 12678035 | JF216671 | 04/22/2022 11:30:00 PM | 044XX N BROADWAY | 0810 | THEFT | OVER $500 | APARTMENT | True | True | ... | 41.962247 | -87.65546 | (41.962247182, -87.6554598) | 03 | Uptown | 57182 | 2.32 | 6.01 | 24647.41 | 9516.37 |
| 2 | 12713141 | JF259485 | 05/29/2022 12:00:00 AM | 014XX S MUSEUM CAMPUS DR | 0870 | THEFT | POCKET-PICKING | SPORTS ARENA / STADIUM | False | False | ... | NaN | NaN | NaN | 33 | Near South Side | 28795 | 1.78 | 4.61 | 16176.97 | 6245.93 |
3 rows × 29 columns
By Ryu Kimiko
The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:
Since we need are analyzing house price, we derived some new variables (from existing variables) that intuitively seem to be associated with house price.
We have created a standardized version of the dataset, as we are computing Euclidean distances to find houses similar to a given house
0.1.4 Arrest data preparation
By Paisley Lucier
The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:
Since I will be considering arrest proportion by district, I will look at the distribution of district.
We have created a standardized version of the dataset, as we are computing Euclidean distances to find houses similar to a given house
raw_crime_data.District.value_counts()8 14811
6 14709
12 14353
4 13948
11 13474
1 13058
18 12444
19 12259
25 11955
3 11937
2 11799
9 10375
7 10323
10 9970
5 9785
16 9242
24 8474
15 8310
14 8166
22 7674
17 6996
20 4967
31 14
Name: District, dtype: int64
14 / raw_crime_data.shape[0]5.856686872236376e-05
# making a new column that bins the data into 'side' of Chicago--source:
arrest_data = raw_crime_data.copy()
arrest_data['Side'] = arrest_data.District
arrest_data['Side'] = arrest_data['Side'].map({4: 'South', 5: 'South', 6: 'South', 7: 'South', 22: 'South', 31: 'South', \
11: "North", 14: 'North', 15: 'North', 16: 'North', 17: 'North', 19: 'North', 20: 'North',\
24: 'North', 25: 'North', 1: 'Central', 2: 'Central', 3: 'Central', 8: 'Central', \
9: 'Central', 10: 'Central', 12: 'Central', 13: 'Central', 18: 'Central'})
arrest_data.head(3)| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | Side | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12589893 | JF109865 | 2022-01-11 15:00:00 | 087XX S KINGSTON AVE | 1565 | SEX OFFENSE | INDECENT SOLICITATION OF A CHILD | RESIDENCE | False | True | ... | 46 | 17 | 1194660.0 | 1847481.0 | 2022 | 09/14/2023 03:41:59 PM | 41.736409 | -87.562410 | (41.736409029, -87.562410309) | South |
| 1 | 12592454 | JF113025 | 2022-01-14 15:55:00 | 067XX S MORGAN ST | 2826 | OTHER OFFENSE | HARASSMENT BY ELECTRONIC MEANS | RESIDENCE | False | True | ... | 68 | 26 | 1170805.0 | 1860170.0 | 2022 | 09/14/2023 03:41:59 PM | 41.771782 | -87.649437 | (41.771782439, -87.649436929) | South |
| 2 | 12785595 | JF346553 | 2022-08-05 21:00:00 | 072XX S UNIVERSITY AVE | 1544 | SEX OFFENSE | SEXUAL EXPLOITATION OF A CHILD | APARTMENT | True | False | ... | 69 | 17 | 1185135.0 | 1857211.0 | 2022 | 09/14/2023 03:41:59 PM | 41.763338 | -87.597001 | (41.763337967, -87.597001131) | Central |
3 rows × 23 columns
# considering that district 31 has only 14 observations (which is 5.856686872236376e-03% of the observations),
#and is split between the North and South side on the source map, I considered only the subset of crimes where
# the district was not 31
arrest_data = arrest_data.loc[arrest_data['District'] != 31, :]# read sentiments data
sentiments = pd.read_csv('Police_Sentiment_Scores.csv')
# convert start date to datetime, subset only the values from 2022
sentiments['START_DATE'] = pd.to_datetime(sentiments['START_DATE'])
sentiments['START_DATE'] = sentiments['START_DATE'].map(lambda x: True if x.year == 2022 else False)
sentiments_2022 = sentiments.loc[sentiments['START_DATE'] == True, :]
# also subset only the general columns
sentiments_2022 = sentiments_2022.loc[:, ['DISTRICT', 'SAFETY', 'TRUST', 'T_RESPECT', 'T_LISTEN']]
sentiments_2022| DISTRICT | SAFETY | TRUST | T_RESPECT | T_LISTEN | |
|---|---|---|---|---|---|
| 0 | 20.0 | 61.02 | 57.79 | 57.96 | 57.61 |
| 3 | 8.0 | 58.39 | 62.83 | 64.46 | 61.20 |
| 9 | 25.0 | 67.50 | 58.93 | 60.27 | 57.59 |
| 16 | 16.0 | 64.37 | 71.46 | 73.37 | 69.55 |
| 25 | 25.0 | 62.51 | 65.83 | 66.45 | 65.22 |
| ... | ... | ... | ... | ... | ... |
| 7248 | 11.0 | 53.02 | 47.70 | 48.64 | 46.76 |
| 7251 | 19.0 | 59.55 | 64.68 | 69.81 | 59.55 |
| 7254 | 12.0 | 58.63 | 57.17 | 59.86 | 54.49 |
| 7256 | 20.0 | 62.12 | 60.23 | 61.71 | 58.75 |
| 7260 | 17.0 | 64.77 | 64.76 | 65.45 | 64.08 |
1164 rows × 5 columns
0.1.5 CTA data preparation
By Grace Shao
Implementing the following steps allowed for missing location data to be removed and community area names to be included with their corresponding numbers for enhanced readibility and analysis.
- After subsetting the data to include only crimes that occurred at CTA stations, I removed observations with missing longitude and latitude values. Location data will be used in my data analysis, and removing NA values ensures a clean dataset. Since only around 1% of the CTA data was removed, this step does not have a significant impact on my conclusions.
- Next, I added the names of the community areas. Originally, the column in the raw dataset included the numeric codes of the areas, but not their names. I merged my data with a Wikipedia table that included both to aid in my analysis.
- After merging, all observations of “The Loop” have extra characters at the end. Removing those extra characters ensured that it matched the other community areas’ names.
crimesCTA = raw_crime_data.loc[raw_crime_data['Location Description'].apply(lambda x: 'CTA' in str(x) and 'CTA BUS' not in str(x)
and 'CTA PARKING LOT / GARAGE / OTHER PROPERTY' not in str(x)), :]
# Drop NA location values (longitude and latitude)
numberRows = crimesCTA.shape[0]
numberLongitudeNA = crimesCTA['Longitude'].isna().sum()
numberLatitudeNA = crimesCTA['Latitude'].isna().sum()
print("Number of missing longitudes:", numberLongitudeNA, "Number of missing latitudes:", numberLatitudeNA)
print("Percentage of missing values out of all values is", numberLongitudeNA/numberRows*100, "%")
crimesCTA.dropna(subset=['Longitude'], inplace=True)
crimesCTA.dropna(subset=['Latitude'], inplace=True)Number of missing longitudes: 31 Number of missing latitudes: 31
Percentage of missing values out of all values is 1.2684124386252045 %
# Add a column with the names of the community areas instead of just the number
communityAreas = pd.read_html('https://en.wikipedia.org/wiki/Community_areas_in_Chicago', match = 'Chicago community areas by number, population, and area')
communityAreas = communityAreas[0]
communityAreas.columns = [col[0] for col in communityAreas.columns]
communityAreas = communityAreas.iloc[:-1, [0,1]]
communityAreas.rename(columns={'No.': 'Community Area'}, inplace=True)
communityAreas.head()| Community Area | Name | |
|---|---|---|
| 0 | 01 | Rogers Park |
| 1 | 02 | West Ridge |
| 2 | 03 | Uptown |
| 3 | 04 | Lincoln Square |
| 4 | 05 | North Center |
crimesCTA['Community Area'] = crimesCTA['Community Area'].astype('int64')
communityAreas['Community Area'] = communityAreas['Community Area'].astype('int64')
crimesCTA = pd.merge(crimesCTA, communityAreas, on='Community Area', how='inner')
crimesCTA['Name'] = crimesCTA['Name'].apply(lambda x: x.replace('(The) Loop[11]', 'The Loop'))# Check to make sure the names are accurate
crimesCTA.loc[:, ['Community Area', 'Name']].iloc[-5:]| Community Area | Name | |
|---|---|---|
| 2408 | 70 | Ashburn |
| 2409 | 23 | Humboldt Park |
| 2410 | 9 | Edison Park |
| 2411 | 41 | Hyde Park |
| 2412 | 66 | Chicago Lawn |
0.2 Set graph settings
colors = ["#c77464", "#2396be", "#967aa0", "#ffc858", "#d8a2e7", "#89c388", "#e6a1c6", "#6c6bc9", "#82c1c4", "#d98c55"]
sns.set_palette(sns.color_palette(colors))0.3 Exploratory data analysis
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.
0.3.1 Analysis 1
By <Alyssa Shou>
Graphing the number of crimes by hour as an initial starting point to determine analysis time frames
# Round times down to nearest hour
time_data['hourly_time'] = time_data['Date'].dt.floor('H').dt.time
crime_counts = time_data['hourly_time'].value_counts().sort_index()
crime_counts = crime_counts.to_frame(name = 'Count').reset_index()
plt.figure(figsize=(10, 6))
bx = sns.lineplot(x= crime_counts.index, y=crime_counts.Count, color = "#2396be")
step = len(crime_counts.index) // 24
plt.xticks(range(0, len(crime_counts.index), step))
plt.xlabel('Time (Hourly)', fontsize = 15, labelpad=10)
plt.ylabel('Number of Crimes', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Hour', fontsize = 20)
plt.tight_layout()
plt.show()AttributeError: Can only use .dt accessor with datetimelike values
0.4 Morning Rush
#slicing the data to only look at 6 am to 9 am
morning_rush = time_data.loc[(time_data['Date'].dt.hour >= 6) & (time_data['Date'].dt.hour < 9)]0.4.1 Morning Rush Crime Types
#top 10 types of crime from 6 am to 9 am
combination_mr = morning_rush.groupby(['Primary Type', 'Description']).size().reset_index(name='Count')
combination_mr = combination_mr.sort_values(by = 'Count', ascending = False).head(10)
combination_mr['Type'] = combination_mr['Primary Type'] + " " + combination_mr['Description']
combination_mr| Primary Type | Description | Count | Type | |
|---|---|---|---|---|
| 205 | THEFT | OVER $500 | 1710 | THEFT OVER $500 |
| 201 | THEFT | $500 AND UNDER | 1601 | THEFT $500 AND UNDER |
| 36 | BATTERY | DOMESTIC BATTERY SIMPLE | 1576 | BATTERY DOMESTIC BATTERY SIMPLE |
| 103 | MOTOR VEHICLE THEFT | AUTOMOBILE | 1504 | MOTOR VEHICLE THEFT AUTOMOBILE |
| 52 | CRIMINAL DAMAGE | TO VEHICLE | 1312 | CRIMINAL DAMAGE TO VEHICLE |
| 16 | ASSAULT | SIMPLE | 1055 | ASSAULT SIMPLE |
| 50 | CRIMINAL DAMAGE | TO PROPERTY | 1022 | CRIMINAL DAMAGE TO PROPERTY |
| 39 | BATTERY | SIMPLE | 890 | BATTERY SIMPLE |
| 208 | THEFT | RETAIL THEFT | 692 | THEFT RETAIL THEFT |
| 41 | BURGLARY | FORCIBLE ENTRY | 396 | BURGLARY FORCIBLE ENTRY |
#barplot of crime types
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= combination_mr.Type, y = combination_mr.Count, palette = colors)
plt.xlabel('Crime', fontsize = 15, labelpad=10)
ax.set_xticklabels(labels = ['> $500 Theft', '< $500 Theft', 'Simple Domestic Battery', 'Automobile Theft',
'Vehicle Damage', 'Simple Assault', 'Property Damage', 'Simple Battery',
'Retail Theft', 'Forced Burglary'])
plt.ylabel('Number of crimes', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Type (Morning Rush)', fontsize = 20)
plt.xticks(rotation=45, weight = 'bold')
plt.tight_layout()
plt.show()0.4.2 Morning Rush Locations
mr_locations = morning_rush['Location Description'].value_counts().head(10)
locations_mr = mr_locations.to_frame(name = 'Count').reset_index()
locations_mr| index | Count | |
|---|---|---|
| 0 | STREET | 4989 |
| 1 | APARTMENT | 3990 |
| 2 | RESIDENCE | 2448 |
| 3 | PARKING LOT / GARAGE (NON RESIDENTIAL) | 765 |
| 4 | SIDEWALK | 714 |
| 5 | SMALL RETAIL STORE | 431 |
| 6 | COMMERCIAL / BUSINESS OFFICE | 305 |
| 7 | ALLEY | 301 |
| 8 | OTHER (SPECIFY) | 290 |
| 9 | RESIDENCE - GARAGE | 283 |
locations_mr['Count'].sum()14516
#percent of top 10 crimes in public locations
(4989+765+714+431+301)*100/1451649.60044089280794
#barplot of crime locations
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= locations_mr.index, y = locations_mr.Count, palette = colors)
plt.xlabel('Location', fontsize = 17, labelpad=10)
ax.set_xticklabels(labels = ['Street', 'Apartment', 'Residence', 'Public Garage/Lot',
'Sidewalk', 'Small Retail Store', 'Office', 'Alley', 'Other',
'Residence Garage'])
plt.ylabel('Count', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Location (Morning Rush)', fontsize = 20)
plt.xticks(rotation=45, weight = 'bold', fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.show()0.4.3 Morning Rush Violence
#IUCR code <= 558 + codes 051A, 051B, 041A, 041B, 031A, 031B are considered violent
# Function to determine if a crime is violent based on IUCR code
def is_violent(iucr_code):
# Define extra IUCR codes that indicate violent crimes
violent_iucr = ['051A', '051B', '041A', '041B', '031A', '031B']
if iucr_code.isdigit():
return int(iucr_code) <= 558
else:
return iucr_code in violent_iucr
morning_rush['Violent?'] = morning_rush['IUCR'].apply(lambda x: is_violent(x))
morning_rush.head() #check that column is added| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | hourly_time | Violent? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 12829768 | JF398130 | 2022-09-16 07:00:00 | 030XX N WHIPPLE ST | 0320 | ROBBERY | STRONG ARM - NO WEAPON | APARTMENT | False | False | ... | 03 | 1155521.0 | 1919887.0 | 2022 | 09/15/2023 03:41:25 PM | 41.935972 | -87.703859 | (41.935972475, -87.703858758) | 07:00:00 | True |
| 18 | 13211226 | JF308371 | 2022-06-07 07:00:00 | 002XX S ASHLAND AVE | 5000 | OTHER OFFENSE | OTHER CRIME AGAINST PERSON | GOVERNMENT BUILDING / PROPERTY | False | False | ... | 26 | NaN | NaN | 2022 | 09/15/2023 03:42:23 PM | NaN | NaN | NaN | 07:00:00 | False |
| 28 | 12784904 | JF345855 | 2022-07-31 07:00:00 | 021XX N SPRINGFIELD AVE | 1754 | OFFENSE INVOLVING CHILDREN | AGGRAVATED SEXUAL ASSAULT OF CHILD BY FAMILY M... | APARTMENT | False | True | ... | 02 | 1150028.0 | 1914097.0 | 2022 | 09/16/2023 03:41:56 PM | 41.920193 | -87.724197 | (41.920193109, -87.724197302) | 07:00:00 | False |
| 31 | 12931878 | JF521379 | 2022-12-24 08:00:00 | 021XX S MARSHALL BLVD | 0498 | BATTERY | AGG. DOMESTIC BATTERY - HANDS, FISTS, FEET, SE... | RESIDENCE - PORCH / HALLWAY | True | True | ... | 04B | 1157031.0 | 1889686.0 | 2022 | 09/22/2023 03:41:26 PM | 41.853068 | -87.699130 | (41.853067647, -87.69912978) | 08:00:00 | True |
| 39 | 13213753 | JG427020 | 2022-01-01 06:00:00 | 014XX N SEDGWICK ST | 1130 | DECEPTIVE PRACTICE | FRAUD OR CONFIDENCE GAME | APARTMENT | False | False | ... | 11 | NaN | NaN | 2022 | 09/18/2023 03:42:32 PM | NaN | NaN | NaN | 06:00:00 | False |
5 rows × 24 columns
violent_crimes_percentage = (morning_rush['Violent?'].sum() / len(morning_rush)) * 100
print(f"The percentage of violent crimes is: {violent_crimes_percentage:.2f}%")The percentage of violent crimes is: 23.69%
0.5 Noon
noon = time_data.loc[time_data['Date'].dt.hour == 12]0.5.1 Noon Crime Types
combination_noon = noon.groupby(['Primary Type', 'Description']).size().reset_index(name='Count')
combination_noon = combination_noon.sort_values(by = 'Count', ascending = False).head(10)
combination_noon['Type'] = combination_noon['Primary Type'] + " " + combination_noon['Description']
combination_noon| Primary Type | Description | Count | Type | |
|---|---|---|---|---|
| 203 | THEFT | $500 AND UNDER | 1219 | THEFT $500 AND UNDER |
| 207 | THEFT | OVER $500 | 1102 | THEFT OVER $500 |
| 34 | BATTERY | SIMPLE | 799 | BATTERY SIMPLE |
| 16 | ASSAULT | SIMPLE | 779 | ASSAULT SIMPLE |
| 32 | BATTERY | DOMESTIC BATTERY SIMPLE | 724 | BATTERY DOMESTIC BATTERY SIMPLE |
| 96 | MOTOR VEHICLE THEFT | AUTOMOBILE | 696 | MOTOR VEHICLE THEFT AUTOMOBILE |
| 210 | THEFT | RETAIL THEFT | 642 | THEFT RETAIL THEFT |
| 44 | CRIMINAL DAMAGE | TO PROPERTY | 516 | CRIMINAL DAMAGE TO PROPERTY |
| 46 | CRIMINAL DAMAGE | TO VEHICLE | 500 | CRIMINAL DAMAGE TO VEHICLE |
| 67 | DECEPTIVE PRACTICE | FINANCIAL IDENTITY THEFT OVER $ 300 | 430 | DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OV... |
#barplot of crime types
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= combination_noon.Type, y = combination_noon.Count, palette = colors)
plt.xlabel('Crime', fontsize = 15, labelpad=10)
ax.set_xticklabels(labels = ['< $500 Theft', '> $500 Theft', 'Simple Battery', 'Simple Assault',
'Simple Domestic Battery', 'Automobile Theft', 'Retail Theft', 'Property Damage',
'Vehicle Damage', '> $300 Identity Theft'])
plt.ylabel('Number of crimes', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Type (Noon)', fontsize = 20)
plt.xticks(rotation=45, weight = 'bold')
plt.tight_layout()
plt.show()0.5.2 Noon Locations
noon_locations = noon['Location Description'].value_counts().head(10)
locations_noon = noon_locations.to_frame(name = 'Count').reset_index()
locations_noon| index | Count | |
|---|---|---|
| 0 | APARTMENT | 2840 |
| 1 | STREET | 2729 |
| 2 | RESIDENCE | 2342 |
| 3 | SIDEWALK | 627 |
| 4 | SMALL RETAIL STORE | 576 |
| 5 | PARKING LOT / GARAGE (NON RESIDENTIAL) | 488 |
| 6 | COMMERCIAL / BUSINESS OFFICE | 327 |
| 7 | OTHER (SPECIFY) | 319 |
| 8 | RESTAURANT | 284 |
| 9 | SCHOOL - PUBLIC BUILDING | 246 |
locations_noon['Count'].sum()10778
#percent of top 10 crimes in public locations
(2729+627+576+488+284+246)*100/1077845.92688810539989
#barplot of crime locations
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= locations_noon.index, y = locations_noon.Count, palette = colors)
plt.xlabel('Location', fontsize = 17, labelpad=10)
ax.set_xticklabels(labels = ['Apartment', 'Street', 'Residence', 'Sidewalk',
'Small Retail Store', 'Public Garage/Lot', 'Office', 'Other',
'Restaurant', 'School'])
plt.ylabel('Count', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Location (Noon)', fontsize = 20)
plt.xticks(rotation=45, weight = 'bold', fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.show()0.5.3 Noon Violence
noon['Violent?'] = noon['IUCR'].apply(lambda x: is_violent(x))
noon.head() #check that column is added| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | hourly_time | Violent? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | 12885646 | JF465749 | 2022-11-05 12:49:00 | 044XX W RICE ST | 0620 | BURGLARY | UNLAWFUL ENTRY | WAREHOUSE | True | False | ... | 05 | 1146532.0 | 1905297.0 | 2022 | 09/15/2023 03:41:25 PM | 41.896112 | -87.737267 | (41.896112422, -87.737267038) | 12:00:00 | False |
| 27 | 12767058 | JF324850 | 2022-07-03 12:00:00 | 019XX N LOWELL AVE | 1754 | OFFENSE INVOLVING CHILDREN | AGGRAVATED SEXUAL ASSAULT OF CHILD BY FAMILY M... | APARTMENT | False | True | ... | 02 | 1147084.0 | 1912456.0 | 2022 | 09/16/2023 03:41:56 PM | 41.915747 | -87.735056 | (41.915746919, -87.735056247) | 12:00:00 | False |
| 41 | 13184937 | JG393538 | 2022-08-01 12:00:00 | 020XX W HOMER ST | 1120 | DECEPTIVE PRACTICE | FORGERY | APARTMENT | False | True | ... | 10 | NaN | NaN | 2022 | 08/23/2023 03:41:55 PM | NaN | NaN | NaN | 12:00:00 | False |
| 55 | 12792072 | JF354522 | 2022-08-12 12:00:00 | 032XX N DAMEN AVE | 0620 | BURGLARY | UNLAWFUL ENTRY | APARTMENT | True | False | ... | 05 | 1162437.0 | 1921720.0 | 2022 | 09/20/2023 03:41:26 PM | 41.940860 | -87.678390 | (41.940860174, -87.678390398) | 12:00:00 | False |
| 57 | 12616294 | JF142268 | 2022-02-12 12:46:00 | 054XX S LOOMIS BLVD | 031A | ROBBERY | ARMED - HANDGUN | STREET | True | False | ... | 03 | 1167920.0 | 1868745.0 | 2022 | 11/13/2023 03:41:12 PM | 41.795376 | -87.659766 | (41.795375753, -87.659766174) | 12:00:00 | True |
5 rows × 24 columns
violent_crimes_percentage = (noon['Violent?'].sum() / len(noon)) * 100
print(f"The percentage of violent crimes is: {violent_crimes_percentage:.2f}%")The percentage of violent crimes is: 18.88%
0.6 Evening Rush Hour
evening_rush = time_data.loc[(time_data['Date'].dt.hour >= 16) & (time_data['Date'].dt.hour < 19)]0.6.1 Evening Rush Crime Types
combination_er = evening_rush.groupby(['Primary Type', 'Description']).size().reset_index(name='Count')
combination_er = combination_er.sort_values(by = 'Count', ascending = False).head(10)
combination_er['Type'] = combination_er['Primary Type'] + " " + combination_er['Description']
combination_er| Primary Type | Description | Count | Type | |
|---|---|---|---|---|
| 237 | THEFT | OVER $500 | 3287 | THEFT OVER $500 |
| 232 | THEFT | $500 AND UNDER | 3178 | THEFT $500 AND UNDER |
| 115 | MOTOR VEHICLE THEFT | AUTOMOBILE | 2785 | MOTOR VEHICLE THEFT AUTOMOBILE |
| 37 | BATTERY | DOMESTIC BATTERY SIMPLE | 2563 | BATTERY DOMESTIC BATTERY SIMPLE |
| 17 | ASSAULT | SIMPLE | 2511 | ASSAULT SIMPLE |
| 40 | BATTERY | SIMPLE | 2328 | BATTERY SIMPLE |
| 53 | CRIMINAL DAMAGE | TO VEHICLE | 2240 | CRIMINAL DAMAGE TO VEHICLE |
| 240 | THEFT | RETAIL THEFT | 1995 | THEFT RETAIL THEFT |
| 51 | CRIMINAL DAMAGE | TO PROPERTY | 1680 | CRIMINAL DAMAGE TO PROPERTY |
| 235 | THEFT | FROM BUILDING | 871 | THEFT FROM BUILDING |
#barplot of crime types
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= combination_er.Type, y = combination_er.Count, palette = colors)
plt.xlabel('Crime', fontsize = 15, labelpad=10)
ax.set_xticklabels(labels = ['> $500 Theft', '< $500 Theft', 'Automobile Theft', 'Simple Domestic Battery',
'Simple Assault', 'Simple Battery', 'Vehicle Damage',
'Retail Theft', 'Property Damage', 'Theft (From Building)'])
plt.ylabel('Number of crimes', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Type (Evening Rush)', fontsize = 20)
plt.xticks(rotation=45, weight = 'bold')
plt.tight_layout()
plt.show()0.6.2 Evening Rush Locations
er_locations = evening_rush['Location Description'].value_counts().head(10)
locations_er = er_locations.to_frame(name = 'Count').reset_index()
locations_er| index | Count | |
|---|---|---|
| 0 | STREET | 10651 |
| 1 | APARTMENT | 6168 |
| 2 | RESIDENCE | 4289 |
| 3 | SIDEWALK | 1982 |
| 4 | PARKING LOT / GARAGE (NON RESIDENTIAL) | 1717 |
| 5 | SMALL RETAIL STORE | 1679 |
| 6 | RESTAURANT | 993 |
| 7 | DEPARTMENT STORE | 856 |
| 8 | ALLEY | 759 |
| 9 | OTHER (SPECIFY) | 555 |
locations_er['Count'].sum()29649
#Percentage of top 10 crimes in public locations
(10651+1982+1717+1679+993+856+759)*100/2964962.85878107187426
#barplot of crime locations
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= locations_er.index, y = locations_er.Count, palette = colors)
plt.xlabel('Location', fontsize = 17, labelpad=10)
ax.set_xticklabels(labels = ['Street', 'Apartment', 'Residence', 'Sidewalk', 'Public Garage/Lot',
'Small Retail Store', 'Restaurant', 'Department Store', 'Alley', 'Other'])
plt.ylabel('Count', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Location (Evening Rush)', fontsize = 20)
plt.xticks(rotation=45, weight = 'bold', fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.show()evening_rush['Violent?'] = evening_rush['IUCR'].apply(lambda x: is_violent(x))
evening_rush.head() #check that column is added| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | hourly_time | Violent? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | 12893219 | JF474959 | 2022-11-14 17:00:00 | 012XX S INDEPENDENCE BLVD | 0325 | ROBBERY | VEHICULAR HIJACKING | STREET | True | False | ... | 03 | 1151289.0 | 1894080.0 | 2022 | 09/15/2023 03:41:25 PM | 41.865240 | -87.720090 | (41.865239745, -87.720089778) | 17:00:00 | True |
| 17 | 13210651 | JG424030 | 2022-12-29 17:00:00 | 026XX W 19TH ST | 1563 | SEX OFFENSE | CRIMINAL SEXUAL ABUSE | OTHER (SPECIFY) | False | False | ... | 17 | 1159021.0 | 1890619.0 | 2022 | 11/15/2023 03:41:31 PM | 41.855587 | -87.691800 | (41.855587334, -87.691800214) | 17:00:00 | False |
| 19 | 12928785 | JF517668 | 2022-12-20 16:30:00 | 076XX S MAPLEWOOD AVE | 1020 | ARSON | BY FIRE | RESIDENCE - GARAGE | True | False | ... | 09 | 1160707.0 | 1853794.0 | 2022 | 08/23/2023 03:40:54 PM | 41.754500 | -87.686629 | (41.754500145, -87.686628806) | 16:00:00 | False |
| 22 | 12602374 | JF125017 | 2022-01-26 17:45:00 | 069XX N CLARK ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | True | True | ... | 08B | 1163566.0 | 1945922.0 | 2022 | 09/16/2023 03:41:56 PM | 42.007248 | -87.673556 | (42.007247653, -87.673555844) | 17:00:00 | True |
| 29 | 12812832 | JF378884 | 2022-06-23 17:35:00 | 026XX N MOODY AVE | 0281 | CRIMINAL SEXUAL ASSAULT | NON-AGGRAVATED | APARTMENT | False | True | ... | 02 | 1134954.0 | 1916928.0 | 2022 | 09/16/2023 03:41:56 PM | 41.928242 | -87.779516 | (41.928242161, -87.779515528) | 17:00:00 | True |
5 rows × 24 columns
violent_crimes_percentage = (evening_rush['Violent?'].sum() / len(evening_rush)) * 100
print(f"The percentage of violent crimes is: {violent_crimes_percentage:.2f}%")The percentage of violent crimes is: 23.93%
0.7 Midnight
midnight = time_data.loc[time_data['Date'].dt.hour == 0]combination_midnight = midnight.groupby(['Primary Type', 'Description']).size().reset_index(name='Count')
combination_midnight = combination_midnight.sort_values(by = 'Count', ascending = False).head(10)
combination_midnight['Type'] = combination_midnight['Primary Type'] + " " + combination_midnight['Description']
combination_midnight| Primary Type | Description | Count | Type | |
|---|---|---|---|---|
| 210 | THEFT | OVER $500 | 1738 | THEFT OVER $500 |
| 205 | THEFT | $500 AND UNDER | 1564 | THEFT $500 AND UNDER |
| 33 | BATTERY | DOMESTIC BATTERY SIMPLE | 1503 | BATTERY DOMESTIC BATTERY SIMPLE |
| 48 | CRIMINAL DAMAGE | TO VEHICLE | 1323 | CRIMINAL DAMAGE TO VEHICLE |
| 108 | MOTOR VEHICLE THEFT | AUTOMOBILE | 1317 | MOTOR VEHICLE THEFT AUTOMOBILE |
| 46 | CRIMINAL DAMAGE | TO PROPERTY | 1104 | CRIMINAL DAMAGE TO PROPERTY |
| 35 | BATTERY | SIMPLE | 882 | BATTERY SIMPLE |
| 15 | ASSAULT | SIMPLE | 732 | ASSAULT SIMPLE |
| 74 | DECEPTIVE PRACTICE | FINANCIAL IDENTITY THEFT OVER $ 300 | 708 | DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OV... |
| 75 | DECEPTIVE PRACTICE | FORGERY | 466 | DECEPTIVE PRACTICE FORGERY |
#barplot of crime types
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= combination_midnight.Type, y = combination_midnight.Count, palette = colors)
plt.xlabel('Crime', fontsize = 15, labelpad=10)
ax.set_xticklabels(labels = ['> $500 Theft', '< $500 Theft', 'Simple Domestic Battery', 'Vehicle Damage',
'Automobile Theft', 'Property Damage', 'Simple Battery', 'Simple Assault',
'> $300 Identity Theft', 'Forgery'])
plt.ylabel('Number of crimes', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Type (Midnight)', fontsize = 20)
plt.xticks(rotation=45, weight = 'bold')
#plt.xticks(x, weight = 'bold')
plt.tight_layout()
plt.show()0.7.1 Midnight Locations
midnight_locations = midnight['Location Description'].value_counts().head(10)
locations_midnight = midnight_locations.to_frame(name = 'Count').reset_index()
locations_midnight| index | Count | |
|---|---|---|
| 0 | STREET | 5613 |
| 1 | APARTMENT | 4658 |
| 2 | RESIDENCE | 3744 |
| 3 | SIDEWALK | 794 |
| 4 | PARKING LOT / GARAGE (NON RESIDENTIAL) | 615 |
| 5 | OTHER (SPECIFY) | 432 |
| 6 | ALLEY | 404 |
| 7 | COMMERCIAL / BUSINESS OFFICE | 404 |
| 8 | SMALL RETAIL STORE | 302 |
| 9 | RESTAURANT | 269 |
locations_midnight['Count'].sum()17235
#Percentage of top 10 crimes in public locations
(5613+794+615+404+302+269)*100/1723546.399767914128226
#barplot of crime locations
plt.figure(figsize=(12, 6))
ax = sns.barplot(x= locations_midnight.index, y = locations_midnight.Count, palette = colors)
#ax = sns.barplot(x= locations_midnight.Count, y = locations_midnight.index, palette = colors, orient = "h")
plt.xlabel('Location', fontsize = 17, labelpad=10)
ax.set_xticklabels(labels = ['Street', 'Apartment', 'Residence', 'Sidewalk',
'Public Garage/Lot', 'Other', 'Alley', 'Office',
'Small Retail Store', 'Restaurant'])
plt.ylabel('Count', fontsize = 15, labelpad=10)
plt.title('Number of Crimes by Location (Midnight)', fontsize = 20)
plt.xticks(rotation=45, weight = "bold", fontsize = 13)
plt.yticks(fontsize = 13)
plt.tight_layout()
plt.show()0.7.2 Midnight Violence
midnight['Violent?'] = midnight['IUCR'].apply(lambda x: is_violent(x))
midnight.head()| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | hourly_time | Violent? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | 12735680 | JF286563 | 2022-06-20 00:50:00 | 085XX S PEORIA ST | 041A | BATTERY | AGGRAVATED - HANDGUN | SCHOOL - PUBLIC GROUNDS | False | False | ... | 04B | 1171804.0 | 1848175.0 | 2022 | 09/15/2023 03:41:25 PM | 41.738845 | -87.646126 | (41.738844769, -87.646126146) | 00:00:00 | True |
| 20 | 12713141 | JF259485 | 2022-05-29 00:00:00 | 014XX S MUSEUM CAMPUS DR | 0870 | THEFT | POCKET-PICKING | SPORTS ARENA / STADIUM | False | False | ... | 06 | NaN | NaN | 2022 | 06/05/2022 03:40:47 PM | NaN | NaN | NaN | 00:00:00 | False |
| 32 | 13211560 | JG425458 | 2022-02-26 00:00:00 | 065XX N FRANCISCO AVE | 1120 | DECEPTIVE PRACTICE | FORGERY | APARTMENT | False | False | ... | 10 | NaN | NaN | 2022 | 09/16/2023 03:42:58 PM | NaN | NaN | NaN | 00:00:00 | False |
| 34 | 12826755 | JF395464 | 2022-08-05 00:00:00 | 036XX W 58TH PL | 1752 | OFFENSE INVOLVING CHILDREN | AGGRAVATED CRIMINAL SEXUAL ABUSE BY FAMILY MEMBER | RESIDENCE | False | True | ... | 17 | 1153103.0 | 1865576.0 | 2022 | 09/17/2023 03:41:45 PM | 41.786985 | -87.714185 | (41.786985284, -87.714184585) | 00:00:00 | False |
| 35 | 13212633 | JG426702 | 2022-11-26 00:00:00 | 052XX S KEDZIE AVE | 2820 | OTHER OFFENSE | TELEPHONE THREAT | APARTMENT | False | True | ... | 08A | 1155933.0 | 1869765.0 | 2022 | 11/15/2023 03:41:31 PM | 41.798424 | -87.703696 | (41.798424102, -87.703695673) | 00:00:00 | False |
5 rows × 24 columns
# Calculate the percentage of rows where crime is considered violent
violent_crimes_percentage = (midnight['Violent?'].sum() / len(midnight)) * 100
print(f"The percentage of violent crimes is: {violent_crimes_percentage:.2f}%")The percentage of violent crimes is: 21.88%
0.7.3 Analysis 2
By <Grace Chang>
0.7.3.1 Analyzing the top 12 community areas with the most thefts
# Finding the top 12 community areas that experience the most theft
top_12_community_areas = theft_community_data['Community Name'].value_counts().head(12)
top_12_areas_list = top_12_community_areas.index.to_list()
top_12_areas_data = theft_community_data[theft_community_data['Community Name'].isin(top_12_areas_list)]
top_12_areas_data.head(3)| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | ... | Latitude | Longitude | Location | No. | Community Name | Population | Area (sqmi) | Area (sqkm) | Density (sqmi) | Density (sqkm) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13209277 | JG422539 | 2022-08-15 09:00:00 | 0000X W JACKSON BLVD | 0810 | THEFT | OVER $500 | COMMERCIAL / BUSINESS OFFICE | False | False | ... | NaN | NaN | NaN | 32 | The Loop | 42298 | 1.65 | 4.27 | 25635.15 | 9897.73 |
| 1 | 12678035 | JF216671 | 2022-04-22 23:30:00 | 044XX N BROADWAY | 0810 | THEFT | OVER $500 | APARTMENT | True | True | ... | 41.962247 | -87.65546 | (41.962247182, -87.6554598) | 03 | Uptown | 57182 | 2.32 | 6.01 | 24647.41 | 9516.37 |
| 10 | 12871603 | JF449024 | 2022-10-23 11:55:00 | 034XX N SOUTHPORT AVE | 0860 | THEFT | RETAIL THEFT | SMALL RETAIL STORE | False | False | ... | NaN | NaN | NaN | 06 | Lake View | 103050 | 3.12 | 8.08 | 33028.85 | 12752.44 |
3 rows × 29 columns
theft_type = pd.crosstab(top_12_areas_data['Community Name'], top_12_areas_data['Description'])
theft_type.head(3) # Making the dataframe suitable for a stacked bar-plot| Description | $500 AND UNDER | ATTEMPT THEFT | DELIVERY CONTAINER THEFT | FROM BUILDING | FROM COIN-OPERATED MACHINE OR DEVICE | OVER $500 | POCKET-PICKING | PURSE-SNATCHING | RETAIL THEFT |
|---|---|---|---|---|---|---|---|---|---|
| Community Name | |||||||||
| Austin | 851 | 15 | 1 | 101 | 0 | 636 | 18 | 23 | 178 |
| Chatham | 456 | 9 | 1 | 67 | 1 | 289 | 16 | 12 | 168 |
| Lake View | 515 | 18 | 0 | 286 | 0 | 725 | 149 | 10 | 565 |
# Color palette
colors = ["#c77464", "#2396be", "#967aa0", "#ffc858", "#d8a2e7", "#89c388", "#e6a1c6", "#6c6bc9", "#82c1c4", "#d98c55"]
sns.set_palette(sns.color_palette(colors))
# Stacked barplot
ax = theft_type.plot.bar(stacked = True, ylabel = 'Number of Thefts', figsize=(12, 8), xlabel = 'Community')
ax.tick_params(axis = 'both', labelsize = 13)
ax.tick_params(axis = 'x', labelrotation = 45)
# Labels for the plot
ax.legend(title = 'Type of Theft', fontsize=10, title_fontsize=13)
ax.set_xlabel('Community', fontsize=16)
ax.set_ylabel('Number of Thefts', fontsize=16)
ax.set_title('Types of theft in each community area', fontsize=20)Text(0.5, 1.0, 'Types of theft in each community area')
0.7.3.2 Analyzing the relationship between Population Density & Number of Thefts in a community area
# Create a dataframe with Community Names and their corresponding Number of Thefts
theft_counts_areas = pd.DataFrame(theft_community_data['Community Name'].value_counts())
theft_counts_areas = theft_counts_areas.reset_index()
theft_counts_areas.columns = ["Community Name", "Number of Thefts"]
theft_counts_areas.head(3) | Community Name | Number of Thefts | |
|---|---|---|
| 0 | Near North Side | 4251 |
| 1 | The Loop | 3464 |
| 2 | Near West Side | 3373 |
# Merge with community_areas dataset that includes the population density
theft_counts_areas['Community Name'] = theft_counts_areas['Community Name'].astype(str)
theft_with_density = theft_counts_areas.merge(community_areas, \
left_on = 'Community Name', right_on = 'Community Name', how = 'left')
theft_with_density.head()| Community Name | Number of Thefts | No. | Population | Area (sqmi) | Area (sqkm) | Density (sqmi) | Density (sqkm) | |
|---|---|---|---|---|---|---|---|---|
| 0 | Near North Side | 4251 | 08 | 105481.0 | 2.74 | 7.10 | 38496.72 | 14863.58 |
| 1 | The Loop | 3464 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Near West Side | 3373 | 28 | 67881.0 | 5.69 | 14.74 | 11929.88 | 4606.13 |
| 3 | West Town | 2681 | 24 | 87781.0 | 4.58 | 11.86 | 19166.16 | 7400.05 |
| 4 | Lake View | 2268 | 06 | 103050.0 | 3.12 | 8.08 | 33028.85 | 12752.44 |
(theft_with_density.sort_values(by = 'Density (sqkm)', ascending = False)).head(12)
# Examining the top 12 community areas by population density (looking for the top 4 theft prone areas within these 12)| Community Name | Number of Thefts | No. | Population | Area (sqmi) | Area (sqkm) | Density (sqmi) | Density (sqkm) | |
|---|---|---|---|---|---|---|---|---|
| 0 | Near North Side | 4251 | 08 | 105481.0 | 2.74 | 7.10 | 38496.72 | 14863.58 |
| 4 | Lake View | 2268 | 06 | 103050.0 | 3.12 | 8.08 | 33028.85 | 12752.44 |
| 15 | Edgewater | 919 | 77 | 56296.0 | 1.74 | 4.51 | 32354.02 | 12491.89 |
| 9 | Rogers Park | 1257 | 01 | 55628.0 | 1.84 | 4.77 | 30232.61 | 11672.81 |
| 38 | Albany Park | 507 | 14 | 48396.0 | 1.92 | 4.97 | 25206.25 | 9732.13 |
| 10 | Uptown | 1255 | 03 | 57182.0 | 2.32 | 6.01 | 24647.41 | 9516.37 |
| 6 | Lincoln Park | 1541 | 07 | 70492.0 | 3.16 | 8.18 | 22307.59 | 8612.96 |
| 12 | West Ridge | 967 | 02 | 77122.0 | 3.53 | 9.14 | 21847.59 | 8435.36 |
| 55 | Hermosa | 287 | 20 | 24062.0 | 1.17 | 3.03 | 20565.81 | 7940.46 |
| 20 | Belmont Cragin | 768 | 19 | 78116.0 | 3.91 | 10.13 | 19978.52 | 7713.71 |
| 7 | Logan Square | 1341 | 22 | 71665.0 | 3.59 | 9.30 | 19962.40 | 7707.48 |
| 3 | West Town | 2681 | 24 | 87781.0 | 4.58 | 11.86 | 19166.16 | 7400.05 |
# Plotting all 77 community areas to view the correlation between number of thefts and population density
plt.rcParams['figure.figsize'] = (9, 6)
sns.set(font_scale = 1.5)
ax = sns.regplot(x = 'Density (sqkm)', y = 'Number of Thefts', data = theft_with_density, scatter_kws = {'color':'#c87464'}, line_kws = {'color':'#2596be'})
# Plot labels
ax.set_title('Number of thefts vs. population density of a community area')
ax.xaxis.set_major_formatter('{x:,.0f}')
ax.yaxis.set_major_formatter('{x:,.0f}')
ax.set_xlabel('Population density (sqkm)')
ax.set_ylabel('Number of Thefts')
# Observing a positive correlationText(0, 0.5, 'Number of Thefts')
top_12_areas_data['Description'].value_counts(normalize = True)
# Normalizing to see the percentages of total theft that each type of theft represents
# We can see that financial theft: theft over $500 and theft under $500 represents nearly 60% of total theftDescription
OVER $500 0.330373
$500 AND UNDER 0.280964
RETAIL THEFT 0.224122
FROM BUILDING 0.100814
POCKET-PICKING 0.051600
ATTEMPT THEFT 0.005985
PURSE-SNATCHING 0.005594
DELIVERY CONTAINER THEFT 0.000352
FROM COIN-OPERATED MACHINE OR DEVICE 0.000196
Name: proportion, dtype: float64
0.7.4 Analysis 3
By <Grace Shao>
0.7.4.1 Top 10 community areas with the most crime (Visualization 1)
# Plot
my_palette = ["#881177", "#aa3355", "#cc6666", "#ee9944", "#eedd00", "#99dd55", "#44dd88", "#22ccbb", "#00bbcc", "#0099cc", "#3366bb", "#663399"]
grace_palette = ["#c77464", "#2396be", "#967aa0", "#ffc858", "#d8a2e7", "#89c388", "#e6a1c6", "#6c6bc9", "#82c1c4"]
# List of top 10 most dangerous areas
top10 = crimesCTA['Name'].value_counts().head(10).to_frame()
top10Areas = top10.index.tolist()
# Top 10 most dangerous areas - all observations
top10Data = crimesCTA.loc[crimesCTA['Community Area'].apply(lambda x: x in top10Areas), :]
sns.set(font_scale=2)
g = sns.barplot(data=top10, x=top10['Name'], y = top10.index, label="Number of Crimes", orient='h', palette = grace_palette)
# Set label for x-axis
plt.xlabel("Number of Crimes Committed" , size = 30)
# Set label for y-axis
plt.ylabel("Community Area" , size = 30)
# Set title for figure
plt.title( "Total Amount of Crime by Community Area" , size = 30)
# Display figure
plt.show() 0.7.4.2 Map with crimes plotted to show high crime rates (Visualization 2)
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
fig = px.scatter_mapbox(crimesCTA,
lat="Latitude",
lon="Longitude",
hover_name="ID",
color="Community Area",
color_continuous_scale=px.colors.sequential.Rainbow,
zoom=11,
height=800,
opacity=0.5,
center=dict(lat=41.8725, lon=-87.6413),
width=800).update_traces(marker={"size": 15})
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show("notebook")0.7.4.3 Within The Loop: Most common types of CTA crime and area of station committed (Visualization 3)
plt.rcParams["figure.figsize"] = [12.00, 6]
plt.rcParams["figure.autolayout"] = True
area32 = crimesCTA.loc[crimesCTA['Name'].apply(lambda x: 'The Loop' in str(x)), :]
area32['Combined Description'] = area32["Primary Type"] + [' '] + area32["Description"]
area32Crime = area32['Combined Description'].value_counts().head(6).to_frame()
crimeTypes = area32Crime.index.to_list()
top6Crimes = area32.loc[area32['Combined Description'].apply(lambda x: x in crimeTypes), :]
top6Crimes['Combined Description'].value_counts()
trainPlatform = top6Crimes.loc[:, ['Location Description', 'Combined Description']]
trainPlatformCross = pd.crosstab(top6Crimes['Combined Description'],
top6Crimes['Location Description'])
# Plot
sns.set_palette("husl")
sns.set(font_scale=1.5)
g = trainPlatformCross.plot.barh(stacked = True, ylabel = 'Frequency of Crime', xlabel = 'Location of Crime', color = grace_palette)
g.set_yticklabels(['Simple Assault','Simple Battery','Strong Arm Robbery','Theft Under $500','Theft Over $500','Pick-pocketing'])
# Set label for x-axis
plt.xlabel("Number of Times Committed", size = 20)
# Set label for y-axis
plt.ylabel("Types of Crime", size = 20)
# Set title for figure
plt.title( "Types of Crime and Location Description" , size = 20)
# Display figure
plt.show()
plt.rcdefaults()0.7.4.4 Calculate top 3 stations with the most crime (Visualization 4)
stations = pd.read_csv('CTA_SYSTEM_INFO.csv')
stationsCoords = stations.loc[:, ['STOP_ID', 'STOP_NAME', 'STATION_DESCRIPTIVE_NAME', 'MAP_ID', 'Latitude', 'Longitude']]
crimesCTA_Stations = crimesCTA.copy()
# Using the Haversine formula, calculate closest subway station (longitude and latitude)
from math import cos, asin, sqrt
def distance(lat_crime, long_crime):
distance_list = []
p = 0.017453292519943295
for i in range(stationsCoords.shape[0]):
# Hav value = result of the Haversine Formula
hav = 0.5 - cos((lat_crime-stationsCoords['Latitude'][i])*p)/2 + cos(stationsCoords['Latitude'][i]*p)*cos(lat_crime*p) * (1-cos((long_crime-stationsCoords['Longitude'][i])*p)) / 2
# Append these distances into the list
distance_list.append(12742 * asin(sqrt(hav)))
minimum = distance_list.index(min(distance_list))
stationName = stations['STATION_NAME'][minimum]
return stationName
station_list = []
for i in range(crimesCTA_Stations.shape[0]):
if pd.isna(crimesCTA_Stations['Latitude'].iloc[i]) or pd.isna(crimesCTA_Stations['Longitude'].iloc[i]):
station_list.append('NaN')
else:
station = distance(crimesCTA_Stations['Latitude'].iloc[i], crimesCTA_Stations['Longitude'].iloc[i])
station_list.append(station)
station_list_frame = pd.DataFrame(station_list)
crimesCTA_withStation = crimesCTA_Stations.copy().reset_index()
crimesCTA_withStation['Station'] = station_list_frame
top3Stations = crimesCTA_withStation['Station'].value_counts().head(3).to_frame()g = sns.barplot(data=top3Stations, x='Station', y=top3Stations.index, label="Number of Crimes", palette = grace_palette)
plt.rcdefaults()
plt.rcParams["figure.figsize"] = [5, 3]
plt.rcParams["figure.autolayout"] = True
# Set label for x-axis
plt.xlabel("Number of Crimes Committed" , size = 15)
# Set label for y-axis
plt.ylabel("Station Name" , size = 15)
# Set title for figure
plt.title( "Top 3 Stations With the Most Crime" , size = 20)
# Display figure
plt.tight_layout()
plt.show() 0.7.4.5 Map with top 3 stations with the most crime (Visualization 5)
import plotly.io as pio
pio.renderers.default = "notebook"
conditions = (
crimesCTA_withStation['Station'].isin(['Jackson', 'Roosevelt', '95th/Dan Ryan'])
)
top3Stations_Data = crimesCTA_withStation[conditions]
color_scale = [(0, 'orange'), (1,'red')]
fig = px.scatter_mapbox(top3Stations_Data,
lat="Latitude",
lon="Longitude",
hover_name="ID",
color="Community Area",
color_continuous_scale=color_scale,
zoom=9.7,
height=280,
center=dict(lat=41.7955, lon=-87.6413),
width=400).update_traces(marker={"size": 10})
fig.update_layout(coloraxis_showscale=False)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show("notebook")0.7.4.6 Times of day where crime occurs for top 3 stations (Visualization 6)
import datetime
# Convert Timestamp to string format using strftime
top3Stations_Data['Date_String'] = top3Stations_Data['Date'].dt.strftime('%m/%d/%Y %I:%M:%S %p')
# Apply strptime to convert string back to datetime and extract hour
top3Stations_Data['Hour'] = top3Stations_Data['Date_String'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p')).dt.hour
# Drop the intermediate column 'Date_String' if not needed
top3Stations_Data.drop(columns=['Date_String'], inplace=True)
# Set the figure size
sns.set(rc={'figure.figsize': (10, 2)})
# Create the KDE plot using displot and store the plot object
plot = sns.displot(data=top3Stations_Data, x='Hour', kind='kde', col='Station')
# Set label for x-axis and y-axis
plot.set_axis_labels("Hour", "Frequency", fontsize = 30)
plot.set(xlim=(0, 24))
plt.xticks(range(0, 25, 3), labels=[str(i) for i in range(0, 25, 3)])
plot.set_xticklabels(fontsize=24)
plot.set_titles(fontsize=30)
plt.subplots_adjust(hspace=0, wspace=0.2)
plt.show()0.7.4.7 Calculating how much more dangerous Roosevelt is compared to the average station
totalCrimes = crimesCTA_withStation['Station'].value_counts().sum()
stationsWithoutRoosevelt = crimesCTA_withStation['Station'].nunique() - 1
rooseveltCrimeNumber = 194
averageCrimePerStation = (totalCrimes - rooseveltCrimeNumber) / stationsWithoutRoosevelt
timesMoreDangerous = rooseveltCrimeNumber / averageCrimePerStation
print("The Roosevelt station is", timesMoreDangerous, "times more dangerous than the average station.")The Roosevelt station is 7.868409193330329 times more dangerous than the average station.
0.7.5 Analysis 4
By <Paisley Lucier>
0.7.5.1 1st visualization
# Visualizing proportion of crimes resulting in arrest across side
grouped_sides = arrest_data.groupby(['Side'])
grouped_sides_m = grouped_sides.mean()
grouped_sides_m| ID | Arrest | Domestic | Beat | District | Ward | Community Area | X Coordinate | Y Coordinate | Year | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Side | ||||||||||||
| Central | 1.273337e+07 | 0.107993 | 0.164096 | 813.361489 | 7.901587 | 18.560827 | 38.129553 | 1.169315e+06 | 1.883294e+06 | 2022.0 | 41.835260 | -87.654221 |
| North | 1.274061e+07 | 0.131305 | 0.166681 | 1798.788545 | 17.753468 | 35.142533 | 19.727479 | 1.151533e+06 | 1.919441e+06 | 2022.0 | 41.934816 | -87.718521 |
| South | 1.270866e+07 | 0.108861 | 0.260033 | 791.395471 | 7.690781 | 14.362799 | 57.593862 | 1.178853e+06 | 1.846025e+06 | 2022.0 | 41.732781 | -87.620367 |
ax = sns.barplot(data=grouped_sides_m, x=grouped_sides_m.index, y='Arrest')
ax.set_title("Percentage of Arrested Observations by Side",
fontdict = {'fontsize': 20,
'fontweight': 'bold',
'color': '#704434'})
plt.show()0.7.5.2 2nd visualization
# to ensure that I am not comparing districts with almost 0 observances for a particular crime type,
# I subset the top 12 crimes with the most overall observations1 be sure to show where this is coming from !!
# Subset top 12 crimes
top_12_crimes = arrest_data.loc[((arrest_data['Primary Type'] == 'THEFT') | \
(arrest_data['Primary Type'] == 'BATTERY') | \
(arrest_data['Primary Type'] == 'CRIMINAL DAMAGE') | \
(arrest_data['Primary Type'] == 'MOTOR VEHICLE THEFT') | \
(arrest_data['Primary Type'] == 'ASSAULT') | \
(arrest_data['Primary Type'] == 'DECEPTIVE PRACTICE') | \
(arrest_data['Primary Type'] == 'OTHER OFFENSE') | \
(arrest_data['Primary Type'] == 'ROBBERY') | \
(arrest_data['Primary Type'] == 'WEAPONS VIOLATION') | \
(arrest_data['Primary Type'] == 'BURGLARY') | \
(arrest_data['Primary Type'] == 'NARCOTICS') | \
(arrest_data['Primary Type'] == 'CRIMINAL TRESPASS')), :]# Group observations by side, get average arrest rate for each top 12 crime type by side
grouped_side_type = top_12_crimes.groupby(['Side', 'Primary Type'])
grouped_side_type_m = grouped_side_type.mean().loc[:, 'Arrest']
grouped_side_type_m = grouped_side_type_m.reset_index() ax = sns.barplot(x='Primary Type',y='Arrest', hue='Side', data = grouped_side_type_m)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontdict = {
'color': '#704434'})
ax.set_title("Arrest by Type of Crime (top 12 crimes)",
fontdict = {'fontsize': 20,
'fontweight': 'bold',
'color': '#704434'})
ax.set_xlabel('Primary Type', fontsize=20)
ax.set_ylabel('Arrest Proportion', fontsize=20)
plt.show()1.0.0.1 3rd visualization
# Group by district
sentiments_2022_m = sentiments_2022.groupby(['DISTRICT']).mean()
sentiments_2022_m = sentiments_2022_m.reset_index()
# Add side column
sentiments_2022_m['Side'] = sentiments_2022_m['DISTRICT']
sentiments_2022_m['Side'] = sentiments_2022_m['Side'].map({4: 'South', 5: 'South', 6: 'South', 7: 'South', 22: 'South', 31: 'South', \
11: "North", 14: 'North', 15: 'North', 16: 'North', 17: 'North', 19: 'North', 20: 'North',\
24: 'North', 25: 'North', 1: 'Central', 2: 'Central', 3: 'Central', 8: 'Central', \
9: 'Central', 10: 'Central', 12: 'Central', 13: 'Central', 18: 'Central'})# Group by side
sentiments_2022_m_side = sentiments_2022_m.groupby(['Side']).mean()
sentiments_2022_m_side = sentiments_2022_m_side.reset_index()
sentiments_2022_m_side.drop(columns='DISTRICT', inplace=True)sentiments_2022_m_side_melted = pd.melt(sentiments_2022_m_side,id_vars = ['Side'],var_name = 'Survey_q',value_name = 'Score')ax = sns.barplot(data=sentiments_2022_m_side_melted, x='Survey_q', y='Score', hue='Side')
ax.set_title("Police Sentiments Rating by Side",
fontdict = {'fontsize': 20,
'fontweight': 'bold',
'color': '#704434'})
plt.legend(loc='lower left')
plt.show()1.0.0.2 4th visualization
sentiments_arrest_dis = sentiments_2022_m.copy()
sentiments_arrest_dis = sentiments_arrest_dis.set_index('DISTRICT')
grouped = pd.DataFrame(top_12_crimes.groupby(['District', 'Primary Type']).mean().loc[:, 'Arrest'])
sentiments_arrest_dis.corr()| SAFETY | TRUST | T_RESPECT | T_LISTEN | |
|---|---|---|---|---|
| SAFETY | 1.000000 | 0.915864 | 0.910727 | 0.914462 |
| TRUST | 0.915864 | 1.000000 | 0.996664 | 0.995943 |
| T_RESPECT | 0.910727 | 0.996664 | 1.000000 | 0.985276 |
| T_LISTEN | 0.914462 | 0.995943 | 0.985276 | 1.000000 |
# since these are highly correlated, I will aggregate to one score: average sentiment rating
sentiments_arrest_dis['avg_sent'] = (sentiments_arrest_dis['SAFETY'] + sentiments_arrest_dis['TRUST'] + \
sentiments_arrest_dis['T_RESPECT'] + sentiments_arrest_dis['T_LISTEN']) / 4
sentiments_arrest_dis = sentiments_arrest_dis.drop(columns=['SAFETY', 'TRUST', 'T_RESPECT', 'T_LISTEN'])
sentiments_arrest_dis| Side | avg_sent | |
|---|---|---|
| DISTRICT | ||
| 1.0 | Central | 60.353594 |
| 2.0 | Central | 56.393281 |
| 3.0 | Central | 50.658646 |
| 4.0 | South | 52.999375 |
| 5.0 | South | 51.750729 |
| 6.0 | South | 50.812396 |
| 7.0 | South | 50.670885 |
| 8.0 | Central | 60.179896 |
| 9.0 | Central | 57.743646 |
| 10.0 | Central | 53.744583 |
| 11.0 | North | 51.363542 |
| 12.0 | Central | 57.304740 |
| 14.0 | North | 59.580156 |
| 15.0 | North | 52.271458 |
| 16.0 | North | 68.140333 |
| 17.0 | North | 62.995573 |
| 18.0 | Central | 64.994583 |
| 19.0 | North | 62.284323 |
| 20.0 | North | 59.938229 |
| 22.0 | South | 59.351146 |
| 24.0 | North | 59.805990 |
| 25.0 | North | 58.433021 |
grouped = pd.DataFrame(top_12_crimes.groupby(['District', 'Primary Type']).mean().loc[:, 'Arrest'])
def merge_sentiments(x):
x = pd.merge(left=x, right=sentiments_arrest_dis.reset_index(), left_on='District', right_on='DISTRICT')
return x
grouped_new = grouped.reset_index().groupby('Primary Type').apply(merge_sentiments)
a = sns.FacetGrid(grouped_new, \
col = 'Primary Type',col_wrap=3,height=3.5,aspect = 1)
a.map_dataframe(sns.regplot,x="avg_sent", y="Arrest")
plt.show()# binning the two types of crime with the most visible associations and graphing those for a closer look
grouped_new['Primary Type Binned'] = "None"
def bin_it(x):
if (x['Primary Type'] == 'ASSAULT' or x['Primary Type'] == 'BATTERY'):
x['Primary Type Binned'] = "Assault + Battery"
elif (x['Primary Type'] == 'BURGLARY' or x['Primary Type'] == 'ROBBERY' or x['Primary Type'] == 'THEFT'):
x['Primary Type Binned'] = "Burglary + Robbery + Theft"
else:
x['Primary Type Binned'] = x['Primary Type']
return x
grouped_new = grouped_new.apply(bin_it, axis=1)
grouped_new = grouped_new.drop(columns = 'Primary Type')
grouped_new.reset_index()
grouped_new.groupby('Primary Type Binned')
a = sns.FacetGrid(grouped_new.loc[(grouped_new['Primary Type Binned']== 'Assault + Battery') \
| (grouped_new['Primary Type Binned'] == 'Burglary + Robbery + Theft')], \
col = 'Primary Type Binned',col_wrap=3,height=3.5,aspect = 1)
a.map_dataframe(sns.regplot,x="avg_sent", y="Arrest")
plt.show()1.0.0.3 5th Visualization
grouped = pd.DataFrame(top_12_crimes.groupby(['Primary Type', 'District']).mean().loc[:, 'Arrest'])
grouped_reset = grouped.reset_index()
disparity_df = np.ones([grouped_reset['Primary Type'].nunique(), 4])
disparity_df = pd.DataFrame(disparity_df, columns=['Primary Type', 'District of Max Arrest Proportion', \
'District of Min Arrest Proportion', 'Difference'])
disparity_df['Primary Type'] = grouped_reset['Primary Type'].value_counts().index
disparity_df = disparity_df.set_index('Primary Type')
def get_disparity(x):
maxx = x.Arrest.max()
minn = x.Arrest.min()
disparity_df.loc[x.loc[x['District'] == 1, 'Primary Type'], 'District of Max Arrest Proportion'] = x.loc[x.Arrest.idxmax(), 'District']
disparity_df.loc[x.loc[x['District'] == 1, 'Primary Type'], 'District of Min Arrest Proportion'] = x.loc[x.Arrest.idxmin(), 'District']
disparity_df.loc[x.loc[x['District'] == 1, 'Primary Type'], 'Difference'] = maxx - minn
grouped_reset.groupby('Primary Type').apply(get_disparity)
disparity_df.sort_values(by = 'Difference', ascending = False)| District of Max Arrest Proportion | District of Min Arrest Proportion | Difference | |
|---|---|---|---|
| Primary Type | |||
| WEAPONS VIOLATION | 18.0 | 17.0 | 0.516396 |
| CRIMINAL TRESPASS | 16.0 | 22.0 | 0.449882 |
| OTHER OFFENSE | 14.0 | 3.0 | 0.272351 |
| ROBBERY | 16.0 | 5.0 | 0.134063 |
| NARCOTICS | 19.0 | 6.0 | 0.116667 |
| ASSAULT | 1.0 | 12.0 | 0.085086 |
| BURGLARY | 18.0 | 4.0 | 0.083151 |
| BATTERY | 1.0 | 7.0 | 0.078671 |
| THEFT | 1.0 | 7.0 | 0.072987 |
| CRIMINAL DAMAGE | 16.0 | 2.0 | 0.036573 |
| DECEPTIVE PRACTICE | 22.0 | 25.0 | 0.028950 |
| MOTOR VEHICLE THEFT | 20.0 | 9.0 | 0.028729 |
1.1 Other sections
Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.
Put each model in a section of its name and mention the name of the team-member tuning the model. Below is an example:
1.2 Conclusions and Recommendations to stakeholder(s)
You may or may not have code to put in this section. Delete this section if it is irrelevant.